Lexical Gaps and Lexicalization: Implications for Word Segmentation Systems for Chinese NLP
نویسنده
چکیده
This paper is motivated by the observation that not all adjectives in Chinese have a canonical antonym. For example, most Chinese speakers choose to translate the English word dishonest into a word string bu chengshi ‘not honest’ instead of any antonym candidates of chengshi suggested in antonym dictionaries. Our discourse evidence from corpus data suggests that bu chengshi is evolving into a word in discourse at a faster pace than some other ‘bu + adjective’ strings, and this may result from the lexical gap for a canonical antonym of chengshi and the communicative need for such a word. As a consequence, it is proposed that if the lexicalization process of bu chengshi continues in the future, the string may need to be considered a single word in a segmentation system (i.e., buchengshi ‘dishonest’). For a segmentation system to distinguish between words and phrases, discourse factors should be taken into consideration.
منابع مشابه
First Language Activation during Second Language Lexical Processing in a Sentential Context
Lexicalization-patterns, the way words are mapped onto concepts, differ from one language to another. This study investigated the influence of first language (L1) lexicalization patterns on the processing of second language (L2) words in sentential contexts by both less proficient and more proficient Persian learners of English. The focus was on cases where two different senses of a polys...
متن کاملThe Role of Lexical Resources in CJK Natural Language Processing
The role of lexical resources is often understated in NLP research. The complexity of Chinese, Japanese and Korean (CJK) poses special challenges to developers of NLP tools, especially in the area of word segmentation (WS), information retrieval (IR), named entity extraction (NER), and machine translation (MT). These difficulties are exacerbated by the lack of comprehensive lexical resources, e...
متن کاملThe Contribution of Lexical Resources to Natural Language Processing of CJK Languages
The role of lexical resources is often understated in NLP research. The complexity of Chinese, Japanese and Korean (CJK) poses special challenges to developers of NLP tools, especially in the area of word segmentation (WS), information retrieval (IR), named entity extraction (NER), and machine translation (MT). These difficulties are exacerbated by the lack of comprehensive lexical resources, e...
متن کاملThe Impact of Metalinguistic English Vocabulary Knowledge and Lexical Inferencing on EFL Learners’ Lexical Knowledge Considering the Cross-Linguistic Issue of L1 Lexicalization
The present study endeavors to unravel the enigma of the psycholinguistic mechanisms underpinning bilingual mental lexicon by analyzing the issue of L1 lexicalization as a construct epitomizing an overarching framework. It involves 78 juniors at the Islamic Azad University, Roudehen Branch. The study inspects the impact of the interventionist/noninterventionist treatments on both sets of lexica...
متن کاملNormalized Accessor Variety Combined with Conditional Random Fields in Chinese Word Segmentation
The word is the basic unit in natural language processing (NLP), as it is at the lexical level upon which further processing rests. The lack of word delimiters such as spaces in Chinese texts makes Chinese word segmentation (CWS) an interesting while challenging issue. This paper describes the in-depth research following our participation in the fourth International Chinese Language Processing ...
متن کامل